Proposal

A study on the Financial Health of Engagement, Ohio, USA.

Motivation of the Project

Engagement, Ohio, USA is a small town with huge potential and experiencing sudden growth. We aim to analyse the data and derive insights which will help to plan the budget utilization wisely and to develop the infrastructure of the town to keep up with the growth.

For our Visual Analytics Project we aim to tackle the task 3 of the Vast Challenge.

The Problems

Challenge 3:

Economic considers the financial health of the city. Over time, are businesses growing or shrinking? How are people changing jobs? Are standards of living improving or declining over time?

Consider the financial status of Engagement’s businesses and residents, and use visual analytic techniques to address these questions.

Problem 1: * Over the period covered by the dataset, which businesses appear to be more prosperous? Which appear to be struggling? Describe your rationale for your answers. Limit your response to 10 images and 500 words.

Loading the required packages

packages = c('tidyverse','ggdist','gghalves','ggthemes','hrbrthemes','ggridges','patchwork','zoo', 'ggrepel','ggiraph','lubridate','gganimate','scales', 'plotly','treemap')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Our Solution

Problem 1: (Shachi) Over the period covered by the dataset, which businesses appear to be more prosperous? Which appear to be struggling? Describe your rationale for your answers. Limit your response to 10 images and 500 words.

Dataset

Links to the dataset:

CheckinJournal.csv TravelJournal.csv

The code chunk below imports Restaurants.csv and TravelJournal.csv and Pubs.csv from the data folder into R by using read_csv() function of readr and saves it as Tibble data frame called travel and restaurants and pubs

restaurants <- read_csv("data/Restaurants.csv")
summary(restaurants)
  restaurantId       foodCost      maxOccupancy      location        
 Min.   : 445.0   Min.   :4.070   Min.   : 48.00   Length:20         
 1st Qu.: 783.5   1st Qu.:4.395   1st Qu.: 71.75   Class :character  
 Median :1122.0   Median :5.095   Median : 85.00   Mode  :character  
 Mean   :1123.5   Mean   :5.035   Mean   : 84.70                     
 3rd Qu.:1462.0   3rd Qu.:5.678   3rd Qu.:103.25                     
 Max.   :1805.0   Max.   :5.920   Max.   :119.00                     
   buildingId   
 Min.   : 27.0  
 1st Qu.:151.0  
 Median :294.5  
 Mean   :432.9  
 3rd Qu.:754.2  
 Max.   :991.0  
pubs <- read_csv("data/Pubs.csv")
summary(pubs)
     pubId        hourlyCost      maxOccupancy     location        
 Min.   : 442   Min.   : 6.417   Min.   :60.00   Length:12         
 1st Qu.: 780   1st Qu.: 9.725   1st Qu.:64.00   Class :character  
 Median :1118   Median :11.035   Median :69.50   Mode  :character  
 Mean   :1120   Mean   :10.866   Mean   :71.83                     
 3rd Qu.:1458   3rd Qu.:12.379   3rd Qu.:77.50                     
 Max.   :1800   Max.   :14.840   Max.   :96.00                     
   buildingId    
 Min.   :  29.0  
 1st Qu.: 237.0  
 Median : 495.5  
 Mean   : 484.8  
 3rd Qu.: 595.5  
 Max.   :1012.0  
travel <- read_csv("data/TravelJournal.csv")
summary(restaurants)

Data Wrangling

The Travel Journal contains financial transactions by a participant towards Work/Home Commute, Eating, Coming Back From Restaurant,Recreation (Social Gathering), Going Back to Home. We filter out the records related to Eating and Recreation (Social Gathering).

travel_filt <- travel[travel$purpose %in% c("Eating","Recreation (Social Gathering)"),]

Calculating Amount Spent

Calculating the total amount spent at the location as a difference of the starting balance and ending balance in the travel journal

travel_filt$amountSpent <- travel_filt$startingBalance -travel_filt$endingBalance
saveRDS(travel_filt, 'data/travel_filt.rds')
travel_filt <- readRDS('data/rds/travel_filt.rds')
head(travel_filt)
# A tibble: 6 × 11
  participantId travelStartTime     travelStartLocationId
          <dbl> <dttm>                              <dbl>
1            23 2022-03-01 05:20:00                   532
2           876 2022-03-01 05:50:00                    NA
3           902 2022-03-01 06:05:00                    NA
4           919 2022-03-01 06:00:00                    NA
5           154 2022-03-01 05:55:00                    NA
6           509 2022-03-01 06:00:00                    NA
# … with 8 more variables: travelEndTime <dttm>,
#   travelEndLocationId <dbl>, purpose <chr>, checkInTime <dttm>,
#   checkOutTime <dttm>, startingBalance <dbl>, endingBalance <dbl>,
#   amountSpent <dbl>

Data Visualization

Grouping the data by the travelEndLocationId which is equal to the restaurant ID or the pub ID.

travel_group = travel_filt %>%group_by(travelEndLocationId) %>%
  summarise(amountSpent = sum(amountSpent), 
            .groups = 'drop')%>%
  arrange(desc(amountSpent))
library(dplyr)
data_merge <-merge(x=travel_group, y=restaurants, by.x = 'travelEndLocationId', by.y =  'restaurantId')
data_merge$travelEndLocationId <- as.character(data_merge$travelEndLocationId)
data_merge$amountSpent <- data_merge$amountSpent/1000

Revenue for Restaurants

The restaurants highlighted in red are among those which had a revenue less than 50 thousand dollars over the period of time

library(plotly)

color = c('rgba(222,45,38,0.8)','rgba(204,204,204,1)','rgba(204,204,204,1)', 'rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)', 'rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(222,45,38,0.8)','rgba(222,45,38,0.8)','rgba(222,45,38,0.8)',
          'rgba(222,45,38,0.8)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)')

fig<- plot_ly(data_merge, x = ~reorder(`travelEndLocationId`, -amountSpent), y = ~amountSpent, type = 'bar', marker = list(color = color))

fig <- fig %>% layout(title = "Revenue for Restraunts",
         xaxis = list(title = "Retaurant ID"),
         yaxis = list(title = "Revenue (in thousands $)"))

fig

Revenue for Pubs

The restaurants highlighted in red are among those which had a revenue less than 300 thousand dollars over the period of time

data_pub <-merge(x=travel_group, y=pubs, by.x = "travelEndLocationId", by.y = "pubId")
data_pub$travelEndLocationId <- as.character(data_pub$travelEndLocationId)
data_pub$amountSpent <- data_pub$amountSpent/1000
library(plotly)
color = c('rgba(222,45,38,0.8)','rgba(222,45,38,0.8)','rgba(222,45,38,0.8)', 'rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)','rgba(204,204,204,1)', 'rgba(204,204,204,1)', 'rgba(204,204,204,1)', 'rgba(204,204,204,1)')
fig<- plot_ly(data_pub, x = ~reorder(`travelEndLocationId`, -amountSpent), y = ~amountSpent, type = 'bar', marker = list(color = color))

fig <- fig %>% layout(title = "Revenue for Pubs",
         xaxis = list(title = "Pub ID"),
         yaxis = list(title = "Revenue (in thousands $)"))

fig

Treemap for Retaurants based on amount spent

treemap(data_merge,
        index=c("travelEndLocationId"), 
        vSize="amountSpent",
        vColor="amountSpent",
        title="Amount Spent in thousands of Dollars - Restaurants",
        title.legend = "Amount Spent in thousands of Dollars - Restaurants"
        )

Revenue generated in 2022 over the months - Pubs

data_travel= travel_filt %>%
  mutate(weekday = weekdays(checkInTime),
         day = day(checkInTime),
         month=as.character(checkInTime,"%b %y"),
         year = year(checkInTime),
         monthYear = floor_date(checkInTime, "month"),
         travelEndLocationId=as.character(travelEndLocationId),
         timeSpent = checkOutTime - checkInTime,
         participantId=as.character(participantId),
         purpose=as.character(purpose))
data_travel$timeSpent <- as.numeric(as.character(data_travel$timeSpent))
data_travel <- data_travel[,c("participantId","travelStartLocationId", "travelEndLocationId", "purpose",  "amountSpent","timeSpent","weekday","day","month","year","monthYear")]

group_pub <-merge(x=data_travel, y=pubs, by.x = "travelEndLocationId", by.y = "pubId")
group_restaurant<-merge(x=data_travel, y=restaurants, by.x = 'travelEndLocationId', by.y =  'restaurantId')
group_pub = group_pub %>% 
  filter(year == 2022) %>%
  group_by(month, travelEndLocationId) %>%
  summarise(amountSpent = (sum(amountSpent)/1000))
group_restaurant = group_restaurant %>% 
  filter(year == 2022) %>%
  group_by(month, travelEndLocationId) %>%
  summarise(amountSpent = (sum(amountSpent)/1000))
p<- ggplot(group_pub, aes(x=month, y=amountSpent, group=travelEndLocationId)) +
  geom_line(aes(color=travelEndLocationId),show.legend = TRUE)+
  labs(
    y= 'Revenue (Thousands$)',
    x= 'months -2022',
    title = "Revenue Pubs - 2022",
    caption = "Ohio USA"
  ) +
  theme_minimal()+
  theme(axis.ticks.x= element_blank(),
        panel.background= element_blank(), 
        legend.background = element_blank(),
        plot.title = element_text(size=12, face="bold",hjust = 0.5),
        plot.subtitle = element_text(hjust = 1),
        plot.caption = element_text(hjust = 0),
        axis.title.y= element_text(angle=0))

ggplotly(p)

Revenue generated in 2022 over the months - Restaurants

p<- ggplot(group_restaurant, aes(x=month, y=amountSpent, group=travelEndLocationId)) +
  geom_line(aes(color=travelEndLocationId),show.legend = TRUE)+
  labs(
    y= 'Revenue (Thousands$)',
    x= 'months -2022',
    title = "Revenue Restaurants- 2022",
    caption = "Ohio USA"
  ) +
  theme_minimal()+
  theme(axis.ticks.x= element_blank(),
        panel.background= element_blank(), 
        legend.background = element_blank(),
        plot.title = element_text(size=12, face="bold",hjust = 0.5),
        plot.subtitle = element_text(hjust = 1),
        plot.caption = element_text(hjust = 0),
        axis.title.y= element_text(angle=0))

ggplotly(p)

`

Problem 2: (Rakendu) How does the financial health of the residents change over the period covered by the dataset? How do wages compare to the overall cost of living in Engagement? Are there groups that appear to exhibit similar patterns? Describe your rationale for your answers. Limit your response to 10 images and 500 words.

Dataset

The Financial Journal of the participants was used to derive insights about the financial health of the residents. Let us take a look at the data :

Financial Data

Data Wrangling

We use the dplyr package to group by participant id and date from timestamp to find the income and expenditure of the participants. The code can be seen here

We will read in the wrangled data saved as rds file in order reduce the size

participant_fin <- read_rds("data/rds/participant_fin.rds")

We can use a scatterplot to understand the variations in income vs expenses of participants over time.

participant_fin %>%
  filter(date >= 'Apr 2022') %>%
  transform(date = as.Date(date, frac = 1)) %>%
  ggplot(aes(x=income, y = abs(expense), size = savings, color = educationLevel))+
  geom_point(alpha=0.7) +
  ggtitle("Income vs Expense by different Education Levels") +
  ylab("Expense") +
  xlab("Income")+
  theme_minimal() +
  theme(axis.line = element_line(size = 0.5),
        axis.text = element_text(size = 16),
        axis.title = element_text(size=16),
        axis.title.y = element_text(angle = 0),
        legend.title = element_text(size =16),
        legend.text = element_text(size = 16),
        plot.title = element_text(size =20,hjust = 0.5))+
  labs(title ='Period : {frame_time}')+
  transition_time(date)+
  ease_aes('linear')

From the above plot it can be seen that the the Groups with Low and High School education has lower income as well as lower variation in income (along x axis ) and lower variation in expense (along y axis). The participants with graduate and bachelors education has notably higher variation.

To understand the variations of different groups better, let us first aggregate by the education level.

participant_fin %>%
  filter(date >= 'Apr 2022') %>%
  transform(date = as.Date(date, frac = 1)) %>%
  group_by(educationLevel,date) %>%
  summarise(AvgIncome = mean(income), AvgExpense = mean(expense), AvgSavings = mean(savings)) %>%
  ggplot(aes(x=AvgIncome, y = abs(AvgExpense), size = AvgSavings, color = educationLevel))+
  geom_point(alpha=0.7) +
  ggtitle("Avg Income vs Avg Expense by different Education Levels") +
  ylab("Expense") +
  xlab("Income")+
  theme_minimal() +
  theme(axis.line = element_line(size = 0.5),
        axis.text = element_text(size = 16),
        axis.title = element_text(size=16),
        axis.title.y = element_text(angle = 0),
        legend.title = element_text(size =16),
        legend.text = element_text(size = 16),
        plot.title = element_text(size =20,hjust = 0.5))+
  labs(title ='Period : {frame_time}')+
  transition_time(date)+
  ease_aes('linear')

This indeed shows that the participants with higher education has larger variations in their income and expenses.

In a similar way, we will also group the participants by age to understand variation in the income and expenses based on age groups.

participant_fin$agegroup <- cut(participant_fin$age, breaks = c(17,30,40,50,60), 
                             labels = c("18-30","30-40","40-50","50-60"))

participant_fin %>%
  filter(date >= 'Apr 2022') %>%
  transform(date = as.Date(date, frac = 1)) %>%
  group_by(agegroup,date) %>%
  summarise(AvgIncome = mean(income), AvgExpense = mean(expense), AvgSavings = mean(savings)) %>%
  ggplot(aes(x=AvgIncome, y = abs(AvgExpense), size = AvgSavings, color = agegroup))+
  geom_point(alpha=0.7) +
  ggtitle("Avg Income vs Avg Expense by different Education Levels") +
  ylab("Expense") +
  xlab("Income")+
  theme_minimal() +
  theme(axis.line = element_line(size = 0.5),
        axis.text = element_text(size = 16),
        axis.title = element_text(size=16),
        axis.title.y = element_text(angle = 0),
        legend.title = element_text(size =16),
        legend.text = element_text(size = 16),
        plot.title = element_text(size =20,hjust = 0.5))+
  labs(title ='Period : {frame_time}')+
  transition_time(date)+
  ease_aes('linear')

It is interesting to note that the age grooup 30-40 has the highest mean income as well as expense and the adjacent age group, 40-50 has the lowest.

The above 3 plots aim to answer the questions related to the financial health of residents of Engagement.

Problem 2: (Jeremiah) Describe the health of the various employers within the city limits. What employment patterns do you observe? Do you notice any areas of particularly high or low turnover? Limit your response to 10 images and 500 words.